Lab 05: Data modelling and model selection

Introduction

This week's lab is focused on data modelling and model selection. At the end of the lab, you should be able to use scikit-learn to:

  • Create a rule-based spam classification model for SMS messages.
  • Predict whether a given SMS message is spam or not.
  • Generate a set of different candidate models and select the best one.
  • Measure the accuracy of the final model.

Getting started

Let's start by importing the packages we'll need. Like last week, we're going to use scikit-learn (sklearn), a modelling and machine learning library for Python.


In [ ]:
import itertools
import pandas as pd

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_predict
from sklearn.feature_extraction import stop_words
from sklearn.metrics import classification_report

Next, let's load the data. Write the path to your sms.csv file in the cell below:


In [ ]:
data_file = 'data/sms.csv'

Execute the cell below to load the CSV data into a pandas data frame with the columns label and message.

Note: This week, the CSV file is not comma separated, but instead tab separated. We can tell pandas about the different format using the sep argument, as shown in the cell below. For more information, see the read_csv documentation.


In [ ]:
sms = pd.read_csv(data_file, sep='\t', header=None, names=['label', 'message'])
sms.head()

Building a spam classifier

Let's build a rule-based model that classifies the SMS messages as either spam or ham by matching the content of the message against a known list of spam phrases. The cell below defines a custom scitkit-learn classifier that accepts an optional list of spam phrases to match against (the spam argument) and a optional boolean (the lowercase argument) that defines whether the messages should be converted to lowercase before comparison with the phrases in the list.

Note: scikit-learn doesn't support generic rule-based models out of the box, so we have to code our own. However, it does support many different kinds of machine learning models, so we won't usually have to do this.


In [ ]:
class PhraseClassifier(BaseEstimator, ClassifierMixin):
    '''A rule-based spam classifer, backed by a list of spam phrases.'''

    def __init__(self, spam=[], lowercase=False):
        '''Initialises the classifier.
        
        Args:
            spam: A list of phrases used to identify spam.
            lowercase: Whether to convert messages to lowercase before predicting their class.
        '''
        self.spam = spam
        self.lowercase = lowercase

    def fit(self, X, y=None):
        '''Fits the classifier.
        
        Note: As the classifier is rule-based, this is just a dummy method to ensure
        compatability with scikit-learn.
        
        Args:
            X: Unused.
            y: Unused.
        
        Returns:
            The classifier object (self).
        '''
        return self

    def predict(self, X, y=None):
        '''Predicts the classes of the given messages.
        
        Args:
            X: List of messages to classify.
            y: Unused.
        
        Returns:
            A list of classifications, corresponding to the given messages.
        '''
        results = []
        for message in X:
            message = message.lower() if self.lowercase else message
            cls = self.spam_or_ham(message)
            results.append(cls)
        return results

    def spam_or_ham(self, message):
        '''Classifies the given message as spam or ham.
        
        Args:
            message: The message to classify.
        
        Returns:
            The predicted class of the message: 'spam' or 'ham'.
        '''
        # Start out assuming we have ham
        result = 'ham'
        
        # If any of the phrases in self.spam match, then mark the message as spam
        for phrase in self.spam:
            if phrase in message:
                result = 'spam'
                break
        
        return result
    
    def score(self, X, y=None):
        '''Computes a score for the given messages and ground truth labels.
                
        Args:
            X: List of messages to classify.
            y: List of the true classes of the messages.
        
        Returns:
            The average number of correct classifications by the model.
        '''
        y_pred = self.predict(X, y)
        return sum([1 if y1 == y2 else 0 for y1, y2 in zip(y, y_pred)]) / len(y)

By convention, when we build a predictive model using scikit-learn, we separate our data into two variables, X and y:

  • X: This represents the data to train on or evaluate, i.e. it is a matrix consisting of explanatory variables / features.
  • y: This represents the true values of the quantity we are trying to predict, i.e. it is the target.

For instance, if we wanted to predict ice cream sales based on temperature, then we would assign the temperature data (an explanatory variable) to the X variable and the ice cream sales data (the target variable) to the y variable.

As we are trying to predict the label (spam or ham) of SMS messages based on their content, we'll assign the messages (the explanatory variables) to the X variable and the labels (the target variables) to the y variable:


In [ ]:
X = sms['message']
y = sms['label']

We can now build our first classifier. Let's create one that marks every message as ham to start with (i.e. one with no spam phrases). It won't be very useful, but it will give us a baseline accuracy from which to improve on:


In [ ]:
clf = PhraseClassifier()  # clf is short for classifier
y_pred = clf.predict(X)

To measure how well our classifier works, let's create a classification report using the true labels for the SMS messages (y) and the predicted labels we've just created (y_pred):


In [ ]:
print(classification_report(y, y_pred))

While we haven't covered some of the terms here in class yet, they're not difficult to understand:

  • Precision: The proportion of the classifications that were correct (e.g. correctly predicted hams / total predicted hams). Ideally, precision is 100%.
  • Recall: The proportion of a class that was correctly predicted (e.g. correctly predicted hams / total actual hams). Ideally, recall is 100%.
  • F1 score: The harmonic mean of precision and recall. This acts as a "unified" measure of accuracy. Ideally, the F1 score is 100%.
  • Support: The number of samples of the given class (e.g. the numbers of hams and spams).

We can interpret the results above like this:

  • 87% of the messages we labelled as ham were actually ham (precision for ham = 0.87).
  • None of the messages we labelled as spam were actually spam (precision for spam = 0.00).
  • We labelled every actual ham as ham (recall for ham = 1.00).
  • We labelled no actual spam as spam (recall for spam = 0.00).
  • We made predictions for 5572 messages, 4825 of which were ham and 747 of which were spam.

This isn't surprising, given that our classifier labels every message as ham and ~87% of our dataset (4825/5572) is ham.

Let's make an improvement by adding the spam phrases "urgent" and "win". This is easy to do: we just have to create a new classifier and use it to make some new predictions:


In [ ]:
clf = PhraseClassifier(spam=['urgent', 'win'])
y_pred = clf.predict(X)

print(classification_report(y, y_pred))

As can be seen, the results have improved:

  • 88% of the messages we labelled as ham were actually ham (precision for ham = 0.88).
  • 51% of the messages we labelled as spam were actually spam (precision for spam = 0.51).
  • We labelled 99% of actual ham as ham (recall for ham = 0.99).
  • We labelled 9% of actual spam as spam (recall for spam = 0.09).

Our overall F1 score has also increased (from 0.80 to 0.82), which indicates that our model is performing better in general.

The PhraseClassifier algorithm also accepts a boolean lowercase keyword argument. When it's set to true, messages are converted to lowercase before being compared with our list of spam phrases. This way, we can catch messages like "WINNER!! ..." as well as "SIX chances to win...". Let's see what effect this has on our results:


In [ ]:
clf = PhraseClassifier(spam=['urgent', 'win'], lowercase=True)
y_pred = clf.predict(X)

print(classification_report(y, y_pred))

Again, this has improved our spam detector!

Model selection

So far, we've chosen phrases that are good indicators of spam. But if we include a word that isn't such a good indicator, our model becomes worse:


In [ ]:
clf = PhraseClassifier(spam=['urgent', 'win', 'hi'], lowercase=True)
y_pred = clf.predict(X)

print(classification_report(y, y_pred))

It would be nice if we choose a set of spam keywords that maximized the performance of our model. We can do this using model selection, i.e. building a set of different candidate models and choosing the best one. This is easy to do with scikit-learn!

Let's start by creating a sorted list of the most popular words in the spam messages in our training set:


In [ ]:
# Create a list of the words in the training set messages that are labelled as spam
spam_messages = X[y == 'spam']
spam_words = [word.lower() for message in spam_messages for word in message.split()]

# Order the spam words by popularity
top_spam_words = pd.Series(spam_words).value_counts().index.tolist()

# Print the top ten
top_spam_words[:10]

It looks like a lot of the most popular words are ones we commonly use. These aren't good indicators of spam in general, so let's remove them. scikit-learn defines a set of stop words, i.e. words that are so commonly used that they don't indicate anything in particular. Let's remove these from our set of most popular words:


In [ ]:
top_spam_words = [word for word in top_spam_words if word not in stop_words.ENGLISH_STOP_WORDS]

top_spam_words[:10]

This looks like a better set of words to use. Let's make a list of all the possible different combinations of the most popular of these words (a brute force approach) using Python's itertools.combinations method:


In [ ]:
candidates = top_spam_words[:5]  # Use the top five words
combinations = [combination for n in range(1, len(candidates)) \
                for combination in itertools.combinations(candidates, n)]

print('Total combinations: %d' % len(combinations))

Now, let's define a set of parameters to build models for. The set is just a Python dictionary, where the keys match the arguments of the classifier we're using (i.e. PhraseClassifier) and the values represent different choices that can be made for a particular key:


In [ ]:
param_grid = {
    'spam': combinations,       # Try every combination of spam phrases
    'lowercase': [True, False]  # Try setting lowercase True and False
}

We can use the GridSearchCV class from scikit-learn to build every possible model defined by the parameters and choose the best one (i.e. to do model selection). Generally, when we use GridSearchCV, we will specify three parameters:

  1. A model-building algorithm.
  2. The set of parameters to use to build models.
  3. An internal cross validation technique to use to measure the accuracy of the models built using the parameters.

In this case, we will use PhraseClassifier to build the model and the set of parameters above to generate the different configurations.

We can set the cross validation technique that the grid search uses via the cv keyword argument. As we're trying to predict a categorical variable (spam or ham), and our class sizes are imbalanced (87% of the data is ham), we should make sure to stratify the selection of train and test sets to ensure similar class distributions in each. One way we can do this is with the StratifiedKFold class, which implements a stratified version of K-fold cross validation.

We can also use an outer cross validation to evaluate the error of the model selected by the grid search. For this, we need to create a second cross validator object and use it to make a set of final predictions, which we can compare to the ground truth labels to compute our overall model accuracy.

Note: In the cell below, we also set random_state=0 so that the split occurs the same way each time. This is just so this notebook runs the same way on different computers and everyone gets the same result.


In [ ]:
# Use inner CV to select the best model
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)  # K = 5

clf = GridSearchCV(PhraseClassifier(), param_grid=param_grid, cv=inner_cv)
clf.fit(X, y)

# Use outer CV to evaluate the error of the best model
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)  # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)

print(classification_report(y, y_pred))
print('Best parameters: %s' % clf.best_params_)

As can be seen, the best spam classifier that can be generated from the top ten spam words is the one that simply checks whether a messages contains the word "free" or not. If we had enough resources (time, compute power), we could determine the best spam classifier based on all of the words in the spam messages, not just the top five.